Machine learning–based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease

Abstract Background Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning–based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. Results We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray–Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. Conclusion Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.


Background
Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states.To have a concrete selection among a high number of features, Recursive Feature Elimination (RFE) has been widely used in the bioinformatics field.However, machine learning based RFE has factors that decrease the stability of feature selection.In this paper, we suggested methods to improve stability while sustaining performance.

Results
We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1569 samples).We found that applying an already published data transformation before RFE improves feature stability significantly.Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance.To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features.We were able to confirm that the mapping by the Bray-Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance.Multi-Layer Perceptron (MLP) algorithm exhibited the highest performance among eight different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets.Conversely, when utilizing only a limited number of biomarkers as a tradeoff between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance.Using the optimal pipeline we developed, we identified fourteen biomarkers for IBD at the species level and analyzed their roles using SHapley Additive exPlanations.Conclusion Taken together our work showed not only how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance, but also provided useful insights for future comparative studies.However, the application of ML methods to microbiota might be challenging for a number of reasons there's a lack of standards in data structures, metadata collection, and pre-processing.This leads to limited generalizability [REF17#-REF18#] of the results in terms of diagnostic and prognostic biomarkers, i.e. taxa that could be used as diagnostic or prognostic markers [REF19#] and that, as such, need to be robust and reproducible indicators of the biological state.Each dataset was given as input to a "typical" ML pipeline, including splitting the data into test and training set, feature selection, and a final classification step.Since generalized biomarker discovery is a key point in this landscape, our effort focused on studying the feature selection step.In particular, we used Recursive Feature Elimination (RFE) within a bootstrap embedding [REF21#] as described in Section 2.2. to identify and select robust features.Moreover, the integration of prior knowledge in the learning process was investigated as a method to achieve higher stability of the biomarker list [REF22#].
A fourth external dataset for which the samples' labels were not available, was used for this purpose, to compute feature similarity.We referred to this procedure as mapping strategy, since it is based on a kernel-based data transformation that projects data in a new space where the more similar two features are, the closer they are mapped in the new space.The basic idea is to take into account the correlation of different taxa: if some features are strongly correlated, then they likely have similar importance and are equally relevant for the classification task.More details about the mapping strategy are provided in Section 2.3.In Section 2.4 we introduce all metrics used to assess features stability (robustness of taxa selected).
With the selected set of features, we used different classification approaches to classify IBD-affected patients and healthy controls, namely logistic regression, support vector machines, random forests, extreme gradient boosting, and neural networks.In Section 2.5 we introduce all the prediction methods used in our study.To assess the generalizability of the classifiers and the robustness of the microbial signature, the models developed using ED1 were tested on test1 (obtained by splitting ED1 in training and test in proportions 80% and 20% respectively) and on the entire ED2.Similarly, models developed using ED2 were tested on test2 and on the entire ED1.
Using the Bray-Curtis similarity matrix to map the data provides the best, improved stability, without sacrificing the classification performance.Using this pipeline, we selected the top 14 features as a tradeoff between optimal performance and method generalizability.The best-performing algorithm on this robust set of features was the Random Forest.We further investigated the role of these biomarkers using Shapley values To ensure the reproducibility of the results, datasets and all code written in support of this publication are publicly available at https://gitlab.com/sysbiobig/mlonmicrobiome.However, no standard approach has been identified yet.We pre-processed each dataset independently and then combined them in ED1 and ED2.First, all the taxa with the same taxonomy classification (i.e.

Data
same species level or genus level) were aggregated and their respective counts were summed, as usually done in microbiome studies [REF19#-REF33#-REF34#].We considered both species and genus levels independently, thus obtaining two abundance matrices for each dataset.After that, we filtered out those taxa with more than 99% of abundances equal to 0. It is worth noting that usually, in the field of microbiota analysis, a threshold of 95% or 90% is used.However, lower thresholds can alter the composition of the abundance profiles (i.e. the abundances of all the features in a sample) and have an effect on the following normalization step [REF35#].Then we divided the abundance profiles by the geometric mean and took the log (base 2) of the ratios leveraging on the idea of clr transformation, using a pseudocount value equal to the minimum observed data.
After finishing preprocessing Dataset#1+Dataset#2+Dataset#3 separately, we integrated the three datasets as shown in Fig. 1.To this purpose, from the entire set of features, only the ones common throughout all datasets were kept.In this way, we selected a core set of features with 283 taxa at the species level and 220 at the genus level.After datasets splitting in ED1 and ED2, min-max scaling was performed to scale features into the same range.The preprocessing was performed separately to the external dataset (Dataset #4) following the same steps described above.

Feature Selection Approach
For each taxonomic level (genus, species) and each dataset ED1 and ED2, Recursive Feature Elimination (RFE) was used as a feature ranking method to obtain the optimal number of features [REF21#] as described in the following.RFE was performed 100 times with bootstrapping.In each bootstrap, the training dataset was split into an internal training set and an internal test set.A prediction model was trained within the internal training set using Linear Support Vector Machine (SVM).
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.Linear SVMs were chosen because of their cost-efficiency tradeoff and their ability to deal with a high number of predictors as the performance index.Feature importance was measured based on the feature weight since data had been standardized in input.The least important feature was eliminated from the dataset iterating the process until only one feature was left.Sorting the features from the last eliminated, which will therefore have rank 1, and averaging the ranked lists across the 100 bootstraps, the global feature rank was calculated.Finally, MCC was computed independently on each bootstrap internal test sets for different numbers of features and then averaged across the 100 bootstraps.The number of features corresponding to the maximum MCC (MCC shows either a peak or a saturation effect) was taken as optimum.
RFE was performed i) without any transformation; ii) with mapping performed using Pearson correlation; iii) with mapping performed using Bray-Curtis similarity, as explained in the following paragraph.

Similarity Matrix
The similarity matrix is a symmetric matrix encoding the knowledge about the likeness between features.
In Eq. ( 1) It should be noted that, to avoid introducing any bias, the similarity matrices were calculated on dataset 4 (see Fig. 1 and Table 1), an external dataset used only to calculate the similarity matrix to perform the mapping transformation (see paragraph 2.3.1).

Mapping transformation
Information about feature correlation is integrated by mapping data using a kernel transformation that has been shown to possibly alleviate feature instability [REF22#].Transformation matrix  , is obtained using the equation  =  −1 ( + ( − )) , where  is the similarity matrix,  is the diagonal matrix whose elements are the sum of the elements in the rows of the matrix  + ( − ) and  is a tuning parameter.The value of  was decided for each experiment using 5-fold cross-validation within its internal training dataset using a grid of 0.01 and from 0.05 to 1 by step 0.05.In our approach, mapping was used only in the RFE step.Since the ranks are distinct integers, SRCC between two rank sets can be calculated as follows:

Stability
Where d is the difference between the two ranks of each feature and n is the total number of ranked features.
Hamming Distance between two rank sets is calculated as the proportion of disagreeing components as follows: Euclidian distance is the length of the line segment connecting two different points.The stability indexes described above were computed for each pair of bootstrap samples used for RFE and finally averaged across the 100*(100-1)/2 values.
The number of commonly ranked features was defined by counting the number of common features across all bootstraps considering a range that starts from the top rank single feature and expanding the range one by one.The condition can be varied by defining 'common' as consistent across all 100 bootstraps or across at least 66 or 50 bootstraps.
To compare the ability to distinguish unimportant features from important features, a noise filtering experiment was performed.We added 100 randomly generated noisy features (range 0 to 1 from a uniform distribution) to the original features (range 0 to 1 by min-max scaler) before mapping and after mapping and RFE compared the average rank of noisy features from the average rank of true features (Table 3).With a good noise filtering algorithm, noise features are expected to have big feature ranks.

Performance Measurement
MCC was used for the step of Recursive Feature Elimination and to compare the performance of the different models.MCC estimates the correlation between the predictive value and ground truth value by using the following equation: where TP is the number of  1), we set the threshold to 0.5.

Classification Model Algorithms
Eight different types of prediction algorithms, namely logistic regression, linear SVM, random forest, XGBoost, Perceptron, and Multi-Layer Perceptron (MLP) with 1, 2 or 3 hidden layers, were used to classify samples in IBD vs. healthy using the features selected within the RFE phase.Within each training dataset, a classification model which predicts the status of IBD was built.Hyperparameters in each model were tuned using 5-fold cross-validation within the training set with grid search.
Logistic Regression outputs prediction scores between 0 and 1 by applying logistic function to linear regression.In this study, regularized logistic regression, using L2 penalty, i.e.Ridge Regression, was used.For the optimization, lbfgs solver was used with the tolerance parameter of 1e-4 (which followed the default setting of the library sklearn.linear_model.LogisticRegression by scikit-learn) [REF44#].
Linear SVM linearly separates data maximizing the margin between two classes.The squared L2 penalty was applied to prevent overfitting.'rfb' kernel was used, and 'gamma' was set as 'scale' so that it used 1 / (# of features * variation of features) as value of gamma (which followed the default setting of the library sklearn.svm.LinearSVC by scikit-learn) [REF44#].Regularization parameter was tuned with a grid of two to the power of (-5 to 15 with the step of 2).
Random Forest builds many decision trees and uses bagging to make an uncorrelated forest of trees combined for better prediction.Gini-impurity was used to decide the splits, while the maximum depth was unlimited so that nodes are expanded until all leaves are pure or contain less than the parameter 'min samples split'.The parameter 'min samples split', which is the minimum number of samples for splitting an internal node, and the parameter 'min samples leaf', which is the minimum of samples to be at a leaf node were set equal to two.The number of features to search for the best split was set as √# of features.Finally, the number of trees in the forest was set as 100.
Extreme gradient boosting (XGBoost) is a tree-ensemble model which utilizes gradient boosting to combine many trees.XGBRegressor was trained with 200 gradient boosted trees.Base learners had 20 maximum tree depth.Subsample ratio of the training instance was 0.2, while subsample ratio of columns when constructing each tree was 0.5.Minimum loss reduction which is for the partition on a leaf node, was set to 1. Learning rate was tuned with the grid of 0.005,0.01,0.05, 0.1, 0.5.Regularization factor, alpha, which is the L1 regularization term was with the grid of 1e-3, 1e-2, 0.1, 1,10.
Perceptron is a single layer neural network, predicting an output with combinations of input values, weights and bias.Multi-Layer Perceptron (MLP) is a feedforward neural network with multiple hidden layers.In the experiments, MLP Regressor which has a single output layer was built.LBFGS optimizer minimizing the squared error function was used to optimize weight variables, while ReLu function was utilized for the activation function.When the loss score was under 1e-4, the optimization was considered as converged, and the training stopped.Learning rate and L2 regularization term are tuned with the same procedure with XGBoost.MLP with 1,2,3 hidden layers were constructed with hidden layer size corresponding to the number of features.

Mapping Transformation and Recursive Feature Elimination
Hyperparameter  used in mapping (paragraph 2.3.2) decides how much we weigh the similarity matrix in data transformation when performing RFE and was tuned in cross-validation within each internal test set as explained in paragraph 2.3.2.(see Fig. 2).Interestingly, the MCC values exhibit different patterns when comparing mapping performed using Pearson Correlation and mapping performed using Bray Curtis Similarity.For mapping with Pearson Correlation, the MCC shows an unstable profile and a sudden drop as shown in Fig 2 .a and c.The optimal α value appears as the smallest value, 0.01, within the grid range of 0.01 to 1. On the other hand, Bray-Curtis Similarity mappings demonstrates an initial increase in the performance followed by a stable pattern of decline, with optimal α values equal to 0.15 and 0.05 at species level for ED1 and ED2, respectively; and equal to 0.05 at genus level for both ED1 and ED2.Performance dependence on the number of features is illustrated in Fig. 3, where MCC is shown i) with not mapping; ii) with mapping performed using Pearson correlation; iii) with mapping performed using Bray-Curtis similarity.The pattern is similar in the three cases, with MCC close to 0.8 slightly increasing with the number of features and saturating only at the very right side of the curve.As a consequence, the optimal number of features is high (87.1% of features are selected in average) in most of RFE experiments.the only exception is represented by mapping based on Pearson correlation on ED1 at species level, which reaches the maximum MCC when using 22 features.The maximum value of the MCC (average calculated on the 100 bootstrap samples) is shown in the label of Fig. 3 as 'Maximum MCC', with the corresponding number of features indicated as "Optimal feature number".

Stability Indexes
Table 2 illustrates the values of the stability metrics across the different bootstraps in each experiment.
At species level, mapping data using Bray-Curtis similarity allows reaching better stability with every dataset and every metric.At genus level, mapping data using Bray-Curtis similarity still shows the better stability in ED1 dataset (with the exception of Hamming distance and SRCC metrics) and in ED2 dataset (with the exception of Hamming distance metric).Supplementary Fig. 1 and 2 show the number of commonly ranked features across all 100 bootstraps or across at least 66 or 50 bootstraps.At small ranks, the difference between different mapping strategies is not clear as the number of common features across bootstraps is too small.However, as more features are considered, RFE with Bray Curtis similarity-based mapping shows a greater number of common features in every dataset except ED1 in genus level.

Noise Filtering
To assess the noise filtering ability of the various approaches (paragraph 2.4.1),we subtracted the average rank of noisy features from the average rank of original data features (283 features in species level, 220 features in genus level).In

Performance comparison by selected set of features
Prediction models were built based on the optimal set of features identified using linear SVM based RFE and Bray-Curtis-similarity based mapping (Fig. 3).To validate the methods, the models trained on ED1 were tested in both Test1-ED1 and on the entire ED2 and models trained on ED2 were tested on Test2-ED2 and on the entire ED1. 8 different prediction algorithms (see Section 2.5) were tested.
Overall performance obtained by different methods using different metrics (AUC, accuracy, sensitivity, specificity, PPV, NPV, MCC) are illustrated in Supplementary Table 1,2,3 and 4. Supplementary Table 5 summarizes the best MCC with best performing algorithm in each experiment.MLP with 1 or 2 hidden layer shows the best MCC in most of the experiments (23 among 24 results).
As the optimal number of features is high in most of RFE experiments (Fig. 3.), prediction models were also trained and compared with a constant number of a few top features.We choose to consider the top 14 features (see paragraph 3.4).Results obtained using different algorithms are shown in Supplementary Table 6.With smaller number of selected features, the performance slightly decreases and the best performing algorithm varies.Random Forest shows the best performance in the most cases (15 among 24), followed by MLP-1,2 hidden layer, XGBoost, Logistic Regression and Linear SVM.RFE without mapping shows better performance than RFE with mapping by with Bray-Curtis similarity with 0.0016 in average.Considering the small gap, mapping with Bray Curtis Similarity, while improving feature stability, does not seem to affect methods performance in terms of MCC, which overall, remains quite stable.As the optimal number of features is high in most of RFE experiments (Fig. 3.), prediction models were also trained and compared with a constant number of a few top features.We choose to consider the top 14 features as a tradeoff between optimal performance and generalizability potential, based on results

Full Data Experiment and Biomarker Selection
shown in Figure 4 where we calculated the MCC as the difference in MCC at different subsequent numbers of features and selected the value at which the MCC starts to converge to zero.Compared to models with optimal number of features (267), overall performance decreases and the best model algorithm changes.While MLP with 1 hidden layer is the best algorithm with MCC 0.963 in Supplementary Table 7, it decreases to 0.826.Instead, Random Forest decreases to 0.845 from 0.854, which is the maximum MCC among the eight algorithms.

Discussion
Stable feature selection is a prerequisite to decide strong biomarkers.To prove the validity of a set of features as biomarkers, they should be observed consistently every time a feature selection algorithm is applied.In this paper, we performed Recursive Feature Elimination multiple times using bootstrapping and checked the level of consistency of feature ranks between trials using stability indexes.Moreover, we investigated various strategies which could improve the stability of selected biomarkers.
First, we used bootstrap to cope with the scarcity of samples in the training set compared to the number of features [REF45#].In this way we generated various classifiers, selected various features on various data splits, and then averaged the results, preserving a high ranking only for those features that are consistently the most discriminating features across the splits.Second, since the high number of features makes the problem under-constrained, i.e. many possible sets of features can be considered relevant to the task and equally good in terms of accuracy, we used additional information from an external dataset (dataset 4) to include additional constraints in terms of feature mapping.In other words, if some features are strongly correlated, then they likely have similar importance, and we want them to be equally relevant for the classification task.
Three different datasets were merged to increase the number of examples and mitigate the potential batch effect, and then split in two.As the division of the dataset was performed using bootstrapping, the ratio of non-IBD vs IBD is preserved, and the ratio of the data source was also preserved without significant difference as shown in Supplementary Table 9.However, the distribution of each feature may differ as the size of the dataset was small compared to the number of features.
Mapping transformation with Linear SVM based RFE improved the stability without the loss of performance.We recommended using Bray-Curtis similarity to improve the stability of the selected features since, in most cases, mapping using Bray-Curtis similarity showed significant increase of stability compared to other approaches and higher noise filtering score, at both genus and species level.
Moreover, as shown in Supplementary Fig 1 and 2, mapping with Bray-Curtis similarity presents a higher number of common features, already at early stage of RFE.
Notably, the Bray-Curtis similarity was designed for compositional data (a composition is represented by a vector of proportions with respect to a total sum on the sample under observation) and is therefore more robust to spurious correlations introduced by the fact that the taxonomic abundances are multivariate observations whose information content is closely linked to the relationship between the components.
A possible drawback of our method is that the mapping transformation compresses the information by shortening the distance between similar features.In case the degree of the transformation is high ( in the transformation matrix equation  =  −1 ( + ( − ))), the loss of potential information can be important.Choosing  in cross-validation as done in this work, might help mitigating this risk although at a higher computational cost.Indeed, in our experiments the optimal choice of  did not highlight a loss in classification performance.
There are similar approaches that utilizes similarity matrix to map similar features into closer space To further assess the consistency of our results and the robustness of our approach, we calculated the FDR-adjusted p-values using the Wilcoxon rank-sum test, fold change and average, standard deviation information of the top 14 biomarkers we selected in Supplementary Table 11.We found that 13 of 14 biomarkers had a q-value below 0.001 when comparing between IBD and non-IBD samples.However, despite the statistically significant differences in the markers, 52.3% of the features showed significant differences between IBD and non-IBD groups, which implies that relying solely on statistical tests may not be sufficient in determining good biomarkers.The robustness of our algorithm and the excellence of machine learning algorithms are shown by high performing prediction for this sparse type of dataset.
It should be reported that we originally performed the overall experiment without logarithmic transformation (data not shown) and checked later that logarithmic transformation improved the performance significantly.We expect the logarithmic transformation has accounted for the highly skewed distribution of microbial abundances [REF49#].Before using the logarithmic transformation, the best performing algorithm was random forest instead of MLP.While tree-based ensembles utilize feature splitting based on a set of thresholds, neural network utilizes continuous values to build nonlinear relationships among variables, and logarithmic transformation is expected to advantage the performance in this aspect.
However, when fewer features are considered, random forest outperforms other methods including MLP.In recent studies, random forest provides generally higher performance than other conventional algorithms for microbial data analysis [REF33#-REF50#-REF51#-REF52#].We suppose the higher complexity of MLP algorithm takes more advantage when a higher number of features is considered; whereas, in case fewer features are considered, the 'split' approach of tree-based ensemble models may fit to microbial data better.
University of Padova School of Engineering: Universita degli Studi di Padova Scuola di Ingegneria Padova, ITALY 1. Introduction Next-Generation Sequencing technologies allow reconstructing the internal composition of the whole microbial community (microbiota) present in a sample possibly exploiting two different approaches: Whole Genome Shotgun sequencing (WGS) and targeted amplicon sequencing of 16S ribosomal RNA (16S rDNA-seq) [REF1#-REF2#].The first focuses on all genomes, while the second only on a region of a single gene, i.e. 16S rRNA gene.Different bioinformatics pre-processing pipelines can be used to obtain the so-called abundance matrix and taxonomy matrix [REF3#-REF4#] from raw-read data.The abundance matrix describes the relative abundance of different operating taxonomic units (OTUs) or of different Amplicon Sequence Variant (ASV) [REF5#-REF6#] on each sample, while the taxonomy matrix contains information about the taxonomy of each OTU or ASV, i.e. kingdom, phylum, class, order, family, genus, species.Machine learning (ML) can be used to classify samples based on their taxa composition, thus identifying a microbial signature that characterizes host phenotypes.Recently, several studies exploited different ML-based techniques to develop models for predicting disease states, such as inflammatory bowel disease [REF7#-REF8#], colorectal cancer [REF9#-REF10#], and cardiovascular disease [REF11#] using microbial data.This growing interest is mainly due to the potential impact on diagnosis and therapeutic target identification.

Fig. 1 .
Fig. 1.Diagram for the overall experiments The regularization parameter was tuned within each internal training set using grid search and 5-fold cross-validation with Matthew correlation coefficient (MCC) [REF39#] Different rank-based stability indexes were used to evaluate the robustness of the feature selection algorithm: Spearman's rank correlation coefficient (SRCC), Hamming Distance, Pearson Correlation, and Bray Curtis Dissimilarity [REF42#-REF43#].

Fig. 2 .
Fig. 2. Average MCC across bootstraps at different values of hyperparameter α in mapping transformation.Hyperparameter α is used for controlling the degree of mapping transformation.Panel a: species level, Mapping by Pearson Correlation.Panel b: species level, Mapping by Pearson Correlation.Panel c: genus level, Mapping by Pearson Correlation.Panel d: genus level, Mapping by Pearson Correlation

Fig. 3 .
Fig. 3. Average MCC across the 100 bootstrapped recursive feature elimination process.MCC(y-axis) at varying number of features(x-axis) averaged across the 100 bootstrap internal test sets.Maximum MCC with optimal feature number is written in each legend.Error bars represent the standard deviation of MCC.Panel a: species level, ED1.Panel b: species level, ED2.Panel c: genus level, ED1.Panel d: genus level, ED2.

Fig. 4 .
Fig. 4. Differential of average MCC during RFE.Average MCC is calculated in each feature number, and each average MCC is subtracted by its former average MCC to calculate differentials.x-axis: the number of features, y-axis: differential of average MCC, horizontal red line: y=0.

Fig. 5 .
Fig. 5. SHAP summary plot from Random Forest model trained by top fourteen features in RFE (species level, SHAP values from training dataset).

[
REF46#][REF47#].AggMapNet maps the original data into multi-channel 2D spatial-correlated images based on their pairwise correlation distances using the manifold learning method Uniform Manifold Approximation and Projection (UMAP) [REF48#].Different channels are chosen based on a preliminary clustering step, which is again based on pairwise correlation distances between features.Features maps are then given as input to machine learning methods such as convolutional neural networks to perform classification.Results obtained using AggMapNet [REF46#][REF47#] approach on our dataset are shown in Supplementary Table10and compared with our pipeline (Bray-Curtis similarity-based mapping + RFE + MLP).This comparison proves the strength of our pipeline in the situation of small data and a high number of features, which is typical of omics studies.Differently from AggMapNet [REF46#][REF47#], our pipeline does not project the data in multi-channel 2D spatial-correlated images, thus it is not suitable for 2D data representation and omics data integration.

Table 1 . Summary of datasets used in the study. The
table shows, in different columns: the ID of the dataset (Dataset#), the Pearson correlation is the linear correlation between two sets, dividing the covariance of two features by the product of each standard deviation.Bray-Curtis similarity is the comparison of the composition between two different sites, dividing twice the sum of the lesser value for only those features in common between two sites by the total sum of values counted at both sites for all the features.Assume   as  ℎ (< ) feature value of  ℎ (< ) data sample.Then we can define the list of  ℎ and  ℎ feature value as  = [ 1 ,  2 , …   ] and  = [ 1 ,  2 , …   ] .Pearson correlation and Bray Curtis this paper, we used either Pearson correlation [REF40#] or Bray-Curtis similarity [REF41#].

Pearson Correlation and Bray Curtis Dissimilarity is
identical to Eq. (1) and Eq.(2), but  and  are now comprised of rank of each feature at each bootstrap where   represent the rank of  ℎ ( = 1, … , # ) feature of  ℎ ( = 1, … , 100) bootstrap.Bray Curtis Dissimilarity is calculated by subtracting Bray Curtis similarity value from one.
True Positive, TN of True Negative, FN of False Negative and FP of False Apart from Linear SVM which outputs the class, predictive values should be converted into binary values by establishing a threshold.Considering the positive to negative ratio was not high (with a value of 0.81 as indicated in Table were also measured to compare models' performance.AUC stands for area under the ROC curve, and the ROC curve is drawn by placing False Positive Rate,  + .

Table 2 . Stability metrics for each Recursive Feature Elimination experiment
. For each dataset, the best stability value is highlighted in bold.

Table 3 . Noise filtering result for each Recursive Feature Elimination experiment.
Table 3, except ED1 in species level, mapping with Bray-Curtis similarity shows better noise filtering score than the others.The superiority of mapping with Bray-Curtis similarity is more apparent at genus level, as the gap is triple bigger compared to species level The margin between the rank of noise features and original features is calculated.
The best pipeline for biomarker selection (Linear SVM based RFE with mapping using Bray-Curtis Similarity) is applied to the full dataset at species level without any split (i.e.training set is the combination of training sets in ED1 and ED2, test set of test sets in ED1 and ED2) to obtain the possible biomarkers for IBD.During the RFE process, we compute the average MCC across the 100 bootstraps internal test sets.Similar to Fig.3, average MCC slightly increases with the number of features and the maximum MCC is reached corresponding to 267 features among the 283 considered.The eight different predictive models presented in Section 2.5 are trained using the selected features, illustrated in Supplementary Table7.MLP-1,2 hidden layers consistently show the highest performance in terms of MCC, confirming what previously found.